Skip to content

Conversation

@ita-sammann
Copy link

@ita-sammann ita-sammann commented Dec 5, 2025

  • Implement "safe" mode to prevent writing data to wrong replicaset when vshard rebalance is in progress.
  • Auto switch to safe mode when rebalance process starts.
  • Manual return to fast mode.

I didn't forget about

  • Tests
  • Changelog
  • Documentation

ita-sammann and others added 5 commits December 5, 2025 13:42
Co-authored-by: Satbek Turganbayev <s.turganbayev@corp.mail.ru>
In fast mode ref/unref do nothing.
In safe mode they call bucket_refro,bucket_refrw,
bucket_unrefo,bucket_unrefrw.

Also added ref error handle from storages.
On ref error router will reset bucket, change replicaset
for single operations and retry request.
* add access to _bucket space in privillage tests
* fix select_readview_test
* enable double buckets test
* add safe_mode_disable for cartirdge reload test: test_storage
* add safe_mode_disable for test_any_vshard_call_timeout
pgroup_duplicates.test_duplicates = function(g)
t.skip_if(
not (
utils.tarantool_version_at_least(3, 1) and (g.params.backend == "config")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: implement test for other backend types. 2.11 and cartridge

Copy link
Contributor

@vakhov vakhov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I’ve gone through the changes and left a few small questions.

@ita-sammann ita-sammann force-pushed the TNTP-2109-rebalance-safe-mode branch from 4a8e4d2 to bb11b86 Compare December 9, 2025 11:28
@ita-sammann ita-sammann self-assigned this Dec 10, 2025
@ita-sammann ita-sammann added the do not merge Not ready to be merged label Dec 10, 2025
local utils = require('crud.common.utils')
local stats = require('crud.stats')
local readview = require('crud.readview')
local schema = require('crud.schema')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guys, the commits are absolutely chaotic, this is not the way, we develop open-source modules. We should rebase that branch in master to make the commit history linear, and not just merge it. So, all commits like Minor changes according to review comments will be visible to our users. You can check out, how Georgy Moiseev did it: proper commits, proper commit messages, tests in every commit (e.g. 8d7cae0).

Now, I'm forced to review more than 1k lines in one step, which is very inconvenient and increases the chanse of missed bugs. And our users won't be able to check the individual commits, if they want to. Of course, it's up to you, since I'm not the maintainer of that module, but it's just not nice to develop and merge code like that.

IMHO, the features should be properly split between commits, every commit must include the associated tests, proper commit messages and mentioning of the #448 ticket. Of course, refactoring or code moving should be in the separate commits. Between commits all of the test must pass

Copy link
Author

@ita-sammann ita-sammann Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guys, the commits are absolutely chaotic

Of course these commits will not go to master, I will re-split them before merging the PR.

Now, I'm forced to review more than 1k lines in one step

My bad. Never thought of this PR as of multiple features that can be reviewed separately.


-- @refer rebalance.router_cache_length
-- @function router_cache_length
crud.rebalance.router_cache_length = rebalance.router.cache_length

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we export router_cache_length and router_cache_last_clear_ts to public API? I understand that for router_cache_clear, which is supposed to be called by the TCM/cartridge/ATE, but I don't understand, why getters are needed

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they will be useful in TCM/Cartridge to indicate that cache has actually been cleared and when did it happen.

function crud.init_router()
rawset(_G, 'crud', crud)
rawset(_G, 'crud', crud)
rebalance.metrics.enable_router_metrics()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a crutch. Will it be disabled, when user disables metrics? It seems the whole metrics part should be done in the metrics_registry

safe_mode = false,
safe_mode_enable_hooks = {},
safe_mode_disable_hooks = {},
_router_cache_last_clear_ts = fiber.time()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we initialize the timestamp with current time on module load? It may lead to bugs, we didn't really cleared the router`s cache on module load

end)
end

M.init = rebalance_init

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: let's make the code consistent with all other modules, it seems we tend to make the variable with proper name and assign methods to it .

function registry.init(opts)

if not ref_ok then

table.insert(bucket_ref_errs, bucket_refrw_err.bucket_ref_errs[1])
goto continue

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we continue on error and then just unref all of them? We should go to unref stright away

set_fast_mode()
end

if not hooks_registered then

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You won't need it, if internal.trigger will be used

end
end

set_fast_mode()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, and what if rebalance module is currently in the safe mode, then the trigger won't work and the bucket_ref_unref will remain in the fast. Or am I missing smth?

end
end

if redirect_replicaset ~= nil then

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: why can't we just change the replicaset variable, seems redirect_replicaset is not needed here

noreturn = opts.noreturn,
fetch_latest_metadata = opts.fetch_latest_metadata,
})
local function make_insert()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we just make function wrappers for single bucket and multiple ones? Why do we always repeat the same code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do not merge Not ready to be merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants